Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extractor API #16

Closed
wants to merge 13 commits into from
Closed

Extractor API #16

wants to merge 13 commits into from

Conversation

psrpinto
Copy link
Member

@psrpinto psrpinto commented Aug 20, 2024

WIP

Doubts

Naming

Need better names for:

  • SiteData
  • SiteInfo?

What should happen if the extractor thinks it can handle the source but then can't?

I guess we'd tell the user something like "Extractor didn't find anything, try another one"? Or, instead of having the extractor say whether it supports a source, should we ask the user right away which extractor they want to use?

Should an extractor extract a single type of data?

For example, should the wordpress-rest extractor extract posts and pages, or should there be a wordpress-post-rest and a wordpress-page-rest extractor?

Having different extractors for different kinds of data would unlock the possibility of having specific extractors for certain data types. For example, there could be wordpress-product-rest what extracts products from an eccommerce site.

What about multi-language sources?

We don't need to support it right away, but we should keep it in mind while designing the API.

@psrpinto psrpinto changed the base branch from trunk to simplify-directory-structure August 20, 2024 16:26
@psrpinto psrpinto force-pushed the extractor-api branch 3 times, most recently from 1bfda77 to 57cc696 Compare August 21, 2024 12:54
@akirk akirk force-pushed the simplify-directory-structure branch from 486ac09 to ebd0919 Compare August 21, 2024 12:56
Copy link
Member

@akirk akirk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should happen if the extractor thinks it can handle the source but then can't?

What about having extractors report a confidence between 0 and 100 that they can extract the source?

Should an extractor extract a single type of data?

I think it should not be limited to a single type of data since parsing the DOM might give multiple pieces of data as side-effects. For example, followers and follows might be on the same page and thus could be extracted at the same time.

What concerns me more is how we'll register extractors considering that we might arrive at a large number of them. Maybe we can have a two stage system where we'd only add a subset of extractors based on the URL matching. WordPress for example could be added for all URLs except those that specifically match other matchers.

That way we wouldn't be registering all extractors on every page (since the content script loads into every page).

What about multi-language sources? We don't need to support it right away, but we should keep it in mind while designing the API.

👍

Comment on lines +2 to +3
* Source of data to be extracted, like a DOM document, a URL or any other kind of resource.
* For the moment, only DOM Document is supported.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate on these other kinds of resources? The URL can be accessed through document.location.href.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea behind having a Source interface is so that the API does not depend on a specific data structure, that might not be available in all runtimes. If in the future we would like to run an extractor in nodejs, for example, the document would not exist (or a least would not have the same type).

Another reason would be that we can envision having extractors that don't rely on a document, but instead, for example, pull directly from a URL. (We could also make it so that an extractor can support multiple types of Sources, e.g. DOMSource and URLSource).

If we would not introduce the notion of a Source at this moment, adding support later for multiple types of sources would be a breaking change to the API, which would require updating all existing extractors.

@psrpinto
Copy link
Member Author

That way we wouldn't be registering all extractors on every page (since the content script loads into every page).

This is something I hadn't considered. I think probably we should only run the content scripts if the extension is currently open.

@psrpinto psrpinto deleted the branch simplify-directory-structure September 2, 2024 13:57
@psrpinto psrpinto closed this Sep 2, 2024
@psrpinto
Copy link
Member Author

psrpinto commented Sep 2, 2024

I will open a new PR to implement this when required.

@psrpinto psrpinto deleted the extractor-api branch October 7, 2024 15:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants